In these exercises, we will be going through the dataset you scraped and preprocess it for further analysis. If you couldn’t scrape your own dataset for some reason, you can use one provided by us: RawComments.rds. Should you run into any problems, try using the clue boxes or asking us for help. For the solutions, we will be using the prepared dataset. If you use a different one, be sure to replace the names and paths in the code accordingly.
R session and assign it to an object called comments. Get an overview of the contained variables. What do the variables describe? Why do we have missing data in some of them?
To load the data, you can use the readRDS() function, to get an overview of the contained variables, you can simply use colnames(). To find out more about what the variables mean, you can have a look at the YouTube data API documentation and search for the comments output description.
# Load datas
comments <- readRDS("../data/RawComments.rds")
# overview of columns
colnames(comments)
authorProfileImageUrl, authorChannelUrl, authorChannelUrl.value,video_id,canRate and viewerRating and moderationStatus. Create a new dataframe called Selection containing only the remaining variables.
You can use the subset() function to keep or remove a selection of variables from a dataframe. For more information on how to use it, have a look at its documentation by running ?subset().
# selecting only the columns we need
Selection <- subset(comments,select = -c(authorProfileImageUrl,
authorChannelUrl,
authorChannelId.value,
videoId,
canRate,
viewerRating,
moderationStatus))
# Checking Selection
colnames(Selection)
Check the class of the variable publishedAt in your new dataframe. Is this class suitable for further analysis? If not, change the class to the appropriate one and compute the time difference in publishing dates between the comment in the first row and the comment in the last row.
Do the same transformation for the variable updatedAt
To check the class of the publishedAt variable, you can use the class() function. To check the formatting of the comment timestamp, you can check the YouTube API documentation. To transform character strings into datetime objects in R, you can use the base R function as.POSIXct() or the more convenient anytime() function from the package with the same name.
# checking variable class
class(Selection$publishedAt)
# transforming to datatime object
library(anytime)
Selection$publishedAt <- anytime(Selection$publishedAt,asUTC = TRUE)
class(Selection$publishedAt)
# computing time difference in publishing time between first and last comment
Selection$publishedAt[1] - Selection$publishedAt[dim(Selection)[1]]
# Transforming the updatedAt variable as well
Selection$updatedAt <- anytime(Selection$updatedAt,asUTC = TRUE)
Check the likeCount variable in your data. Is it suitable for numeric analysis? If not, transform it to the appropriate class and test whether your transformation worked.
You can use the class() function to check the class of an object in R. To change a class, for example from character to numeric, you can use the family of “as”-functions, for example as.numeric()
# Checking variable class
class(Selection$likeCount)
## [1] "character"
# Transforming class
Selection$likeCount <- as.numeric(Selection$likeCount)
# Rechecking class
class(Selection$likeCount)
## [1] "numeric"
summary(Selection$likeCount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 13.43 1.00 4943.00
Check the textOriginal column in your Selection dataframe. There are still hyperlinks in the column that we should remove for later text analysis steps. Extract the hyperlinks from the textOriginal column into a new list called Links. In addition, create a new variable called LinksDel that contains the textOriginal but without the hyperlinks.
The qdabRegex package has many pre-built functions for detecting, removing, and replacing specific character strings. You can, for example, use the rm_url() function to extract and replace hyperlinks. As a reminder: You can check the documentation for this function with ?rm_url().
# Loading package
library(qdapRegex)
# Checking column
View(Selection$textOriginal)
# extracting hyperlinks
Links <- rm_url(Selection$textOriginal, extract = TRUE)
head(Links,10)
## [[1]]
## [1] NA
##
## [[2]]
## [1] "https://theobjectivestandard.com/2020/02/justice-for-michael-jackson/\r\nThis"
##
## [[3]]
## [1] NA
##
## [[4]]
## [1] NA
##
## [[5]]
## [1] NA
##
## [[6]]
## [1] NA
##
## [[7]]
## [1] NA
##
## [[8]]
## [1] NA
##
## [[9]]
## [1] NA
##
## [[10]]
## [1] NA
# removing hyperlinks
LinksDel <- rm_url(Selection$textOriginal)
head(LinksDel,10)
## [1] "This whole \"census\" thing sounds like a very easy fucking task do now…."
## [2] "Please Do A Piece on Michael Jackson Hi, I’m writing to you because Last Week Tonight as opposed to other such shows actually cares about issues rather than chasing the headlines. It has been 11 years since Michael Jackson passed away yet to this day what the common consensus is that he is a taboo subject for many. I recall you guys once did a piece on Public Shaming. Michael Jackson was a genius, an abused child. He was strange. He was one of the few major stars from the 80s who came out of the 80s without a heroine addiction. He in his own way did many, Many strange things, but so do most other superstars. And more than others he actually cared. About children, about the earth. About the issues we are discussing to this day. While Icons like Freddy Mercury, Elvis Pressley, Prince, Beetles and many more are known for their good works, Michael is known for the something which he has repeatedly been acquitted for. It’s the truth that anyone looking for will find instantly but due to the “where there is smoke there is a fire” narrative, even 11 years after his death, the new media treats him like a criminal. All his trial pages are open for the public to read. He WAS weird. Making a ranch called neverland, hanging out with children. Trusting people he shouldn’t. But I urge you please cover him, hear beyond the noise like you guys often do. The most successful African American Artist of all time was a humble man child, who respected women, loved children and cared about our environment. He was not a heroine junkie, a private man who did not share his disease even all the way back in 1993 even though he was accused of wanting to become a “White Man”. He was eccentric. Hanging out with animals and caring about them. article covers multiple sources, some of which I had read previously. Michael Jackson was a multi talented millionaire pop star, who was not an alcoholic, was a caring father, a filial son, Treated women with respect, cared about the planet and it’s beings. The press that constantly kept DASHING him, had found a way to subvert their guilt. All those years of calling him a “Jacko”, “monkey” and many more hurtful things was justified if he was a paedophile. They NEEDED him to be guilty. Such a man cannot exist in Hollywood. Such public shaming had allowed and to this day allows Michael to be a victim to all this slander. Please do a piece, if not one that exonerates him then one that once and for all cements the fact the Michael Jackson, the greatest pop star, the first African American Idol was a paedophile. Not through unknown sources or flimsy headlines but through concrete proof. A news echoing in a closed chamber will not reach anyone, most fans who what to know the truth know it, other people will read the headlines but not the explanations. It’s about time the general public knows. Please do a piece on Michael, the blatant mistreatment by the media, The systematic racism he faced during his trials. Please don’t let the first African American Singer be remembered for the things he did not do, He was weird, weird enough to annoy Freddy Mercury with Bubbles feedbacks, weird enough to let kids crack raw eggs over Michael Jackson, Weird enough to play water balloons with children. But he was not a paedophile and the world needs to acknowledge that. He was in no way a “Perfect Human” but he tried his best to live right and we should not punish him for doing that. On this year please exonerate this Black man, the Justice System has done it two decades ago, it’s about time everyone else does. Please do a piece on Michael Jackson."
## [3] "Yay! Now illegals don't have to admit that they're illegal even though there's always that \"Prefer not to say\" option at the bottom that most people click when it asks either: \"Are you currently in debt?\" or about your tax situation."
## [4] "Here's an interesting fact about bats, January of 2020: one of them fly-boys is gonna make your year a whole lot more interesting."
## [5] "\"The 2020 census is likely to be more difficult to do budget shortages and republican medaling\" Is it nice to live in the world of bliss see you after the insurrection then tell me how hard it was and the virus is still going by the way."
## [6] "12:50 \"voting rights has never exactly seemed like a top priority for this administration\" That quote is so much more funny on the day Trump's second impeachment trial starts . . ."
## [7] "Good comedy always comes in fours. And thats a dysney fact ahho"
## [8] "funny how trump thought computers would be more accurate for a census, which i agree with, but then went on a rampage about how fraudulent mail in voting is and online voting would be..."
## [9] "Just explain it to me, an unknowing German woman: do amaricans not have to register when moving into a city?"
## [10] "Whenever I see the Cenusman bit I do the \"Pepsiman\" theme but with Cenusman instead."
Check the LinksDel variable to see if there are still emoji contained in the column. For our later analysis, we want to do three things:
To achieve this, we first need a dictionary of emojis and their corresponding textual descriptions in a usable format. Load the emo package and have a look at the contained dataframe jis. Assign it to a new object called EmojiList. Afterwards, source the provided CamelCase.R script (in the scripts folder) to transform the textual description from regular case into CamelCase. Finally, create a new variable called TextEmoDel containing the text without the emoji (hint: you can use the ji_replace_all() function from the emo package for that).
We provide you with a function that capitalizes the first character of each word. The function is called simpleCap() and the name of the script is CamelCase.R. You can load it into your workspace using the source() function and specifying it’s location. You can find the function in the scripts folder. Keep in mind that this function only capitalizwa the first letters of each word, so you still need to get rid of the extra space characters. The gsub() function is a handy tool for this purpose.
# loading package
library(emo)
# sourcing script
source("../scripts/CamelCase.R")
# Reassigning dataframe
EmojiList <- jis
# Applying the function to all the names
CamelCaseEmojis <- lapply(jis$name, simpleCap)
# Deleting the empty spaces
CollapsedEmojis <- lapply(CamelCaseEmojis,function(x){gsub(" ", "", x, fixed = TRUE)})
# Formatting back from a list to a vector
EmojiList[,4] <- unlist(CollapsedEmojis)
# Overview of first 3 rows
EmojiList[1:10,c(1,3,4)]
## runes emoji name
## 1 1F600 <U+0001F600> GrinningFace
## 2 1F601 <U+0001F601> BeamingFaceWithSmilingEyes
## 3 1F602 <U+0001F602> FaceWithTearsOfJoy
## 4 1F923 <U+0001F923> RollingOnTheFloorLaughing
## 5 1F603 <U+0001F603> GrinningFaceWithBigEyes
## 6 1F604 <U+0001F604> GrinningFaceWithSmilingEyes
## 7 1F605 <U+0001F605> GrinningFaceWithSweat
## 8 1F606 <U+0001F606> GrinningSquintingFace
## 9 1F609 <U+0001F609> WinkingFace
## 10 1F60A <U+0001F60A> SmilingFaceWithSmilingEyes
# Creating text column with removed Emoji (and hyperlinks)
TextEmoDel <- ji_replace_all(LinksDel,"")
Ultimately, we want to use our EmojiList dataframe to replace the instances of emojis in our text with the textual descriptions. We can do that by looping through all emoji in all texts and replacing them one at a time. There is a problem however: Some emojis are made up of multiple “shorter” emojis. If we match part of a “longer” emoji and replace it with its textual description, the rest will become unreadable. For this reason, we need to make sure that we replace the emoji from longest to shortest. Sort the EmojiList dataframe by the length of the emoji column from longest to shortest.
You can count the number of characters in a vector of text using the nchar() function. You can reorder dataframes using the order function and you can reverse an order using the rev() function.
# ordering from longest to shortest
EmojiList <- EmojiList[rev(order(nchar(jis$emoji))),]
# Overview of new order
head(EmojiList[,c(1,3,4)],5)
## runes
## 1862 1F469 200D 2764 FE0F 200D 1F48B 200D 1F469
## 1860 1F468 200D 2764 FE0F 200D 1F48B 200D 1F468
## 1858 1F469 200D 2764 FE0F 200D 1F48B 200D 1F468
## 3570 1F3F4 E0067 E0062 E0077 E006C E0073 E007F
## 3569 1F3F4 E0067 E0062 E0073 E0063 E0074 E007F
## emoji
## 1862 <U+0001F469><U+200D><U+2764><U+FE0F><U+200D><U+0001F48B><U+200D><U+0001F469>
## 1860 <U+0001F468><U+200D><U+2764><U+FE0F><U+200D><U+0001F48B><U+200D><U+0001F468>
## 1858 <U+0001F469><U+200D><U+2764><U+FE0F><U+200D><U+0001F48B><U+200D><U+0001F468>
## 3570 <U+0001F3F4><U+000E0067><U+000E0062><U+000E0077><U+000E006C><U+000E0073><U+000E007F>
## 3569 <U+0001F3F4><U+000E0067><U+000E0062><U+000E0073><U+000E0063><U+000E0074><U+000E007F>
## name
## 1862 Kiss:Woman,Woman
## 1860 Kiss:Man,Man
## 1858 Kiss:Woman,Man
## 3570 Wales
## 3569 Scotland
We now have a working dictionary for replacing emojis with a textual description! Create a new variable called TextEmoRep as a copy of the LinksDel variable. Next, loop through the ordered EmojiList and, for every element in TextEmoRep, replace the contained emoji with “EMOJI_” followed by their textual description. You can use the rm_default() function from the qdapRegex package to replace custom patterns. Be sure to check the documentation so you can set the appropriate options for the function.
NB: There will be warnings in your console even if you are doing everything right.
Loop through the dictionary sorted from longest to shortest emoji. You need to use a for loop to go through all emojis for all comments, one at a time. The paste() function is useful for adding the prefix “EMOJI_” in front of your textual descriptions. Don’t forget to set the arguments fixed = TRUE, clean = TRUE and trim = FALSE in your call to rm_default()
# Assigning the column to a new variable
TextEmoRep <- LinksDel
# switching off warnings
options(warn=-1)
# Looping through all Emojis for all comments in New
for (i in 1:dim(EmojiList)[1]) {
TextEmoRep <- rm_default(TextEmoRep,
pattern = EmojiList[i,3],
replacement = paste0("EMOJI_",
EmojiList[i,4],
" "),
fixed = TRUE,
clean = FALSE,
trim = FALSE)
}
# checking results
LinksDel[159:171]
## [1] "8:01 Donald Trump on Census by computers but won't allow mail in voting?"
## [2] "Did they get clearance for that sound byte?"
## [3] "Census in 2020? Sounds like half a million temp jobs open post corona"
## [4] "If the plans were on a zip drive ; they must have been old, zip drives were only used in the late 90s and early 2000s"
## [5] "Im kinda mostly confused on why they ask for my phone number and email. Sounds like they want to make a buck off our data and so the real question becomes how would we know if they did? Like would we actually be able to hold the census bureau accountable if they broke the law. I dont think the government cares to invade our privacy like that but i am apt to believe they would make a quick buck selling mass user data.and who would call them out for it?"
## [6] "“The 2020 census is likely to be more difficult” You have no idea"
## [7] "man: my name? that's WAY too personal, i can't tell random people that girlfriend: oh- ok. what about your hobbies? man: WHAT THE HECK? HELP! THIS WOMAN IS A PSYCHOPATH STALKER"
## [8] "I wonder what an error this sars 2.0 will create? They should have done the census 2019 unless the purpose is to kill of those, who are not yet in register. So the whole generation of children can just disappear without nobody knowing anything better. 'No, this novel covid does not kill any kids...well how do you know? Nobody has counted the amount of kids less than ten years old! In Nordic countries every baby will be registered and every visitor will be checked outside the EU. But in the case of emergency, the phones are being used. Everybody has a mobile phone, so they can watch how many people are where and how they move around the city. Nobody has to answer anything. They will know it all unless you remove the sim card and the battery away from your phone or slip it into a metal box or so I am being told by Media. Oh by the way, I noticed a new reason the device is not always charging any longer. The device port can have too much dust and stuff in it, which you need to remove by a needle first. Then the charger can get deeper and start to do its job. <U+0001F60E><U+0001F609><U+0001F602>"
## [9] "I think it was pretty brave of that daughter to offer those Zip drives to the media instead of just destroying them immediately or something."
## [10] "I cannot understand why shows like this are “supportive” of illegal immigration. Asking for citizenship should not be an issue, just ask. If you are illegal guess what. YOU ARE DOING SOMETHING ILLEGAL!! You should be in a country illegally. Period."
## [11] "he clearly hasn't heard of a vpn if he wants the computers to do this and also it is kind of a bad idea because if we start doing Census online, people will start to freak out and start to wonder what else are they going to do to hinder my privacy online (of what little we get)."
## [12] "The one where they signed up for the Affordable Care Act - #FriendsReunion"
## [13] "5 months ago: “there is a lot working against this census.” Two months later: PANDEMIC! As if the census needed more discouragement."
TextEmoRep[159:171]
## [1] "8:01 Donald Trump on Census by computers but won't allow mail in voting?"
## [2] "Did they get clearance for that sound byte?"
## [3] "Census in 2020? Sounds like half a million temp jobs open post corona"
## [4] "If the plans were on a zip drive ; they must have been old, zip drives were only used in the late 90s and early 2000s"
## [5] "Im kinda mostly confused on why they ask for my phone number and email. Sounds like they want to make a buck off our data and so the real question becomes how would we know if they did? Like would we actually be able to hold the census bureau accountable if they broke the law. I dont think the government cares to invade our privacy like that but i am apt to believe they would make a quick buck selling mass user data.and who would call them out for it?"
## [6] "“The 2020 census is likely to be more difficult” You have no idea"
## [7] "man: my name? that's WAY too personal, i can't tell random people that girlfriend: oh- ok. what about your hobbies? man: WHAT THE HECK? HELP! THIS WOMAN IS A PSYCHOPATH STALKER"
## [8] "I wonder what an error this sars 2.0 will create? They should have done the census 2019 unless the purpose is to kill of those, who are not yet in register. So the whole generation of children can just disappear without nobody knowing anything better. 'No, this novel covid does not kill any kids...well how do you know? Nobody has counted the amount of kids less than ten years old! In Nordic countries every baby will be registered and every visitor will be checked outside the EU. But in the case of emergency, the phones are being used. Everybody has a mobile phone, so they can watch how many people are where and how they move around the city. Nobody has to answer anything. They will know it all unless you remove the sim card and the battery away from your phone or slip it into a metal box or so I am being told by Media. Oh by the way, I noticed a new reason the device is not always charging any longer. The device port can have too much dust and stuff in it, which you need to remove by a needle first. Then the charger can get deeper and start to do its job. EMOJI_SmilingFaceWithSunglasses EMOJI_WinkingFace EMOJI_FaceWithTearsOfJoy "
## [9] "I think it was pretty brave of that daughter to offer those Zip drives to the media instead of just destroying them immediately or something."
## [10] "I cannot understand why shows like this are “supportive” of illegal immigration. Asking for citizenship should not be an issue, just ask. If you are illegal guess what. YOU ARE DOING SOMETHING ILLEGAL!! You should be in a country illegally. Period."
## [11] "he clearly hasn't heard of a vpn if he wants the computers to do this and also it is kind of a bad idea because if we start doing Census online, people will start to freak out and start to wonder what else are they going to do to hinder my privacy online (of what little we get)."
## [12] "The one where they signed up for the Affordable Care Act - #FriendsReunion"
## [13] "5 months ago: “there is a lot working against this census.” Two months later: PANDEMIC! As if the census needed more discouragement."
We now have the original text column, and the text column with removed hyperlinks in which emojis are replaced with their textual descriptions (TextEmoRep). We need one more variable that only contains the textual descriptions of the emojis. You can use our predefined function ExtractEmoji() from the scripts folder to create this variable.
Use the source() function to source the ExtractEmoji.R script from the scripts folder and then sapply() the ExtractEmoji() function to the variable TextEmoRep. To remove useless rownames from the extracted Emojis, you can set names(Emoji) to NULL
# sourcing function
source("../scripts/ExtractEmoji.R")
# Using function
Emoji <- sapply(TextEmoRep,ExtractEmoji)
names(Emoji) <- NULL
# checking results
TextEmoRep[39]
## [1] "15:46 \"... Despite president Trump's decision to back off from the citizenship question...\" You mean when he flat out *LOST HIS CASE IN FRONT OF THE SUPREME COURT?!?* This is the problem with modern news media: no matter whose side they claim to be on, they simply cannot bring themselves to tell the truth."
Emoji[39]
## [1] "NA"
We now have selected all the variables we need, brought them into the right formats, cleaned the text, and extracted some additional information from it. As a final step, create a new dataframe called df that contains the following variables:
Selection$authorDisplayName
Selection$textOriginal
TextEmoRep
TextEmoDel
Emoji
Selection$likeCount
Links
Selection$publishedAt
Selection$updatedAt
Selection$parentId
Selection$id
Set the following names for the column in the new dataframe:
Author
Text
TextEmojiReplaced
TextEmojiDeleted
Emoji
LikeCount
URL
Published
Updated
ParentId
CommentID
Save the new dataframe as an RDS object with the name “ParsedComments.Rds”
You can use the cbind.data.frame() function to paste together multiple columns into a dataframe. You need to set the argument stringsAsFactors = FALSE though (if your R version is < 4.0.0), to prevent strings from being interpreted as factor variables. In addition, the variables Links and Emoji are lists and can contain multiple values per row. For this reason, you need to enclose them with the I() function to be able to put them into a dataframe. You can save your result using the saveRDS() function.
# creating df dataframe (use I() function to enclose Emoji and Links)
df <- cbind.data.frame(Selection$authorDisplayName,
Selection$textOriginal,
TextEmoRep,
TextEmoDel,
I(Emoji),
Selection$likeCount,
I(Links),
Selection$publishedAt,
Selection$updatedAt,
Selection$parentId,
Selection$id,
stringsAsFactors = FALSE)
# setting column names
names(df) <- c("Author",
"Text",
"TextEmojiReplaced",
"TextEmojiDeleted",
"Emoji",
"LikeCount",
"URL",
"Published",
"Updated",
"ParentId",
"CommentID")
# deleting row names
row.names(df) <- NULL
# saving dataframe
saveRDS(df, file = "../data/ParsedComments.rds")